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Abstract. We present a recursive way to partition hypergraphs which creates and exploits 
hypergraph geometry and is suitable for many-core parallel architectures. Such partitionings are then 
used to bring sparse matrices in a recursive Bordered Block Diagonal form (for processor-oblivious 
parallel LU decomposition) or recursive Separated Block Diagonal form (for cache-oblivious sparse 
matrix-vector multiplication). We show that the quality of the obtained partitionings and orderings 
is competitive by comparing obtained fill-in for LU decomposition with SuperLU (with better results 
for 8 of the 28 test matrices) and comparing cut sizes for sparse matrix-vector multiplication with 
Mondriaan (with better results for 4 of the 12 test matrices). The main advantage of the new method 
is its speed: it is on average 21.6 times faster than Mondriaan. 
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1. Introduction. With tlte increased development and availability of many-core 
processors (both CPUs and CPUs) it is important to have algorithms that can make 
use of these architectures. To this end, we present a new recursive hypergraph par- 
titioning algorithm that uses the underlying geometry of the hypergraph to generate 
the partitioning (which is largely done using shared-memory parallelism). Hyper- 
graph geometry may either be provided from the problem at hand or generated by 
the partitioning software. This entire process is illustrated in Fig. 

1.1. Hypergraphs. We will start with a brief introduction to hypergraphs (see 
[6] for more information) and the ways in which they can be related to sparse matrices. 

Definition 1.1. A hypergraph is a pair G — (V,E) where V is a set, the 
vertices of the hypergraph G, and E a collection of subsets of V (so for all e € E, 



e CV), the hyperedges or nets of G (see Fig. 1.2). We call a hypergraph G = {V,E) 
weighted when it is paired with functions w : V ^ [0,oo[ and c : E ^ [0, c»[ which 
assign weights w{v) > and costs c(e) > to vertices v d V and hyperedges e € E, 
respectively. We call a hypergraph G = (V, E) simply a graph if all hyperedges e d E 
are of the form e = {v, w} with v,w G V ; in this case the graph is undirected and the 
hyperedges are called edges. We call a hypergraph G — {V, E) finite if V is a finite 
set, in which case \E\ < 2l^l, so E is finite as well. 

Hypergraphs possess a natural notion of duality, where the roles played by the 
vertices and hyperedges are interchanged. 

Definition 1.2. Let G — (V,E) be a hypergraph. 

Then we define its dual hypergraph as G* {V*,E*) where V* :— E and 

E* := {{e e E \ e3 v} \ V £ V}. 

If the hypergraph is weighted, we also exchange the vertex weights and hyperedge costs 
to make its dual a weighted hypergraph. 

The dual of the dual of a hypergraph is isomorphic to the original hypergraph, 
(G*)* ^ G. 
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Figure 1.1. The matrices twotone (top) and ford2 (bottom). For the original matrix (left), a 
visual representation is created (middle), which in turn is used to permute the matrix to recursive 
Bordered Block Diagonal form (right). 




Figure 1.2. Hypergraph G = (V, E) mt/t V = {1, 2, 3, 4, 5} and S = {{1}, {1, 2}, {2, 3, 4}, {3, 4}}. 



One direct application of hypergraphs is to use them to represent sparse matrices, 
see Table [T3] We can make a number of observations about these representations. 

1. The symmetric representation is only sensible if the matrix is structurally 
symmetric, because only then we can recover the nonzero pattern of the 
original matrix from its representation. 

2. The bipartite representation is a bipartite graph with the two parts consisting 
of the rows (ri, . . . , r™) and the columns (ci, . . . , c„) of the matrix. 

3. The symmetric and bipartite representations are both undirected graphs in- 
stead of hypergraphs and the size of the symmetric representation is about 
half that of the bipartite representation. 

4. The column-net and row-net representations are each other's dual. 

5. The finegrain and bipartite representations are each other's dual. This is also 
reflected in the fact that the hyperedges of the finegrain representation can 
be partitioned into two disjoint sets (the row and column hyperedges). 

We will make use of these observations in Section |4l 
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Name 


V 


E 


Symmetric |30| 


|1, . . . , ml 


III, 7| 1 1 < i < TO, 1 < 7 < a, o ^ 01 


Bipartite [21] 


{^1 ; ■ ■ ■ ; ; ^1 ; ■ ■ ■ 1 '^n } 


{{ri, Cj} 1 1 < i < m, 1 < j < n, ^ 7^ 0} 


Column-net [7] 


{n, . . . ,rm} 


{{r, 1 1 < i < TO,a,j 9^ 0} 1 1 < J < n} 


Row-net T] 


{ci, . . . ,c„} 


{{cj 1 1 <j <n,a,, ^0} 1 l<i <to} 


Finegrain [8] 


{fij 1 flij 7^ 0} 


<i<m,a,, 7^0} | 1 < J < n} 

v ' 

column hypcrcdgcs 

U{{f»j|l < J <ri,a,j 7^0} 1 l<i <to} 

V ' 

row hypcrcdgcs 



Table 1.1 



Several common representations of an m X n-matrix A = (a; j) by a hypergraph G = {V, E), 



1.2. Visual representations. As stated in Section [T] we would like to exploit 
the underlying geometry of a hypergraph, usually representing a sparse matrix. 

Definition 1.3. Let G = {V,E) be a given hypergraph. Then a visual represen- 
tation of G m d € N dimensions is a mapping 

that reflects the structure of the underlying hypergraph. 

This definition is not very precise and therefore we will illustrate it by looking at a 
few examples. Sometimes the visual representation of a matrix is directly available, for 
instance if the matrix is based on a triangulated mesh, such as the following Example 



1.4 In other cases, the visual representation can be generated from the problem that 
the matrix represents, see Example |1.5| If no such information is available at all, we 
generate the visual representation ourselves, as discussed in Section [2j 

Example 1.4. The pothen collection of matrices, available from JJW, consists of 
NASA structural engineering matrices collected by A. Pothen. We will take a closer 
look at the square pattert^ matrices from this collection, which have a natural visual 
representation. Each row/column of these matrices corresponds to a vertex (the coor- 
dinates of which are supplied in a separate file) and each nonzero to an edge between 
the vertices corresponding to the row and column to which the nonzero belongs. We 
can take a look at the matrices and their corresponding visual representation by plot- 
ting these vertices, as is done in Fig. \1.3\ These vertices give a visual representation 
of the symmetric hypergraph representation of the sparse matrix. 

Example 1.5. Another important example of matrices with a natural visual 
representation are those arising in finite-element methods. In this example, we will 
consider the Laplace equation on a compact smooth d-dimensional closed submanifold 
ri C R"* with compact smooth boundary dft. Let 

V:^{fe C°°{n,Il) I /, IIV/II e L^{n,Il), flon = 0} 

be the collection of all smooth real-valued functions on that are square-integrable, 
have square-integrable derivative, and vanish on the boundary dQ. Here the inner 
product is given by 

{u,v) = / u{x)v{x)dx+ / \/u{x) ■ \7v{x) dx. 
Jn Jn 



^Nonzero entries have numerical value 1. 
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Figure 1.3. From left to right: the matrices spheres, pwt, and commanche_dual with their orig- 
inal meshes (top, see Example \1.4}/ and visual representations created using the techniques described 
in Section^ (bottom, also compare with \24.^ ) . 

Let g ^ V be given. We consider the problem of finding an f Cz V such that 

Af{x) = ~g{x), (xen). (1.1) 

The first step in the finite-element method is to rewrite the above problem into its 
weak formulation Let h G V and suppose f is a solution to eqn (1.1), then using 
integration by parts 

g{x)h(x)dx ^ - Af{x)h{x)dx = -0+ / W f{x) ■ Wh{x) dx. 
n Jn Jn 

Hence, every solution f necessarily satisfies the weak formulation of this problem: 

a{f,h)^bih), {heV), (1.2) 

where 



a{u,v) :— / Vu{x) ■ Vv{x) dx, b{u) := / g{x)u{x)dx. 
Jn Jn 

Lf we now choose functions hi, ... , h„i G V , we can look at the approximate solu- 



tion <j) to eqn [ 1.2) in the subspace Vm '■= {hi, . . . , h.m)^, Q V spanned by hi, ... , hm 
Because (p G Vm, there exist coefficients (pi, ... , </>,„ G R such that (p — 4>i 
and therefore, 

/ m \ m 

b{hj) = a{(p, hj) = a i'^(j)ihi,hj \ = ^ (pi a{h^, hj), (1 < j < m). 



i=l 



^If the weak formulation satisfies the conditions of the Lax-Milgram theorem, the solution / to 
the weak formulation is unique and hence also the solution to the original problem, provided such a 
solution exists |26| . 



So solving eqn U^W in Vm for cj) amounts to solving the linear system 



a{hi,hj)(j>i = b{hj), (1 < j < m), 



for {(pi, . . . , 4>„i) € R™. By choosing hi, ... , h^ such that their supports only have a 
small overlap we create a sparse matrix A G R"'^™ with entries aij := a{hj, hi) that 
are nonzero only for the I < i, j < m where supp hi D supp /i^ ^ 0. This can be done, 
for instance, by triangulating fl and choosing for each vertex i in the triangulation a 
function hi with support contained in all triangles adjacent to the vertex i. 

We can now directly create a visual representation for the various hypergraph 
representations of the matrix A (Table \Tl^ : 

1. for the symmetric, bipartite, column-net, and row-net representations, we 
map each vertex i, ri, and Ci to the center of supp hi in R'^ , 

2. for the finegrain representation, we map each vertex Vij to the center of 
supp hi n supp hj in H'^ . 

Thus, in this case, we can generate a visual representation from our original problem 
with little extra effort. 

However, not all matrices have an immediate geometric origin (e.g. twotone), so 
we would like to be able to create such a visual representation ourselves if it cannot be 
provided directly. Hence, we will continue by discussing how to create such a visual 
representation in Section [2j In Section [3j we show how to use visual representations 
for hypergraph partitioning, and in Section |4] we apply this in the context of sparse 
LU decomposition. Wc conclude by implementing these methods and comparing them 
with both SupcrLU and Mondriaan in Section [5] 

2. Creating visual representations. To create visual representations, we em- 
ploy the method described in ^71[M1I32], generalized to visualizing hypergraphs: we 
let the vertices of the hypergraph repel each other by an electrostatic-like force and 
let the hyperedges bind their vertices together like rubber bands. We model this as 
the minimization of an energy function where we lay out the graph in at least three 
dimensions (to prevent having to treat the special cases d = 1 and d = 2 where the 
form we propose for the energy function is not appropriate). For brevity, we will 
simply enumerate the vertices of the hypergraph we consider, thus assuming that 
V ^ {1, . . . ,k} for some k £N. 

Definition 2.1. Let G = ({1, . . . , fc}, {ei, . . . , e/}) be a finite weighted hypergraph 
with vertex weights Wi, . . . ,Wk > 0, hyperedge costs ci, . . . , c/ > 0, and let d > 3 be 
the dimension of the target space. Define 

U {{xi, ...,Xk)e R"'"' \ yi<i<j <k:x,^ x,}. 

Then the energy function of G is defined as 

fi-^^ f E E 11-^ - -.ir + ^ E E (2-1) 

j = l i£ej i = l i=l " * ^ " 

V „ ' 



rubber bands 

repelling charges 



for (xi, . . . ,Xk) (z U, where a,/3,"f,d > are constants, and for 1 < j < I the center 
of hyperedge Cj is defined as 



1 



Now, we will generate a visual representation by finding 



argniin{/(a::i,...,a::fc) | {xi,...,Xk) e U}, (2.2) 
where / is the energy function of our hypergraph G. Approximate solutions to eqn 



(2.2) generated by our algorithm are shown in Fig. 1.3 



2.1. Constants. We can tinker with / by varying the constants a, /3, 7, and 5. 
First, we can make a number of observations about U , /, and their symmetries. 

1. {{R{xi) + x, . . . , R{xk) + x) I (xi, Xfc) e [/} = [/ for all a; e R'' and all 
orthogonal transformations R G 0(R''). 

2. U is an open subset of R''^'= and U = 9 U ior all 9 e R\ {0}. 

3. f{R{xi)+x,...,R{xk)+x) = f{xi,...,Xk) for all x Re 0(R'^). 

4. / G C{U, [0, oo[ ) is continuous; / is always bounded from below by 0; and if 
A; > 2, / is unbounded from above. 

So U and / are invariant under all translations and orthogonal transformations 
in R"^, but while U is scaling-invariant, / in general is not. 
Let 9 > and (xi, . . . , x^) G U. Then 

(x 9"^ 13 
f{9xi,...,9xk) = 11^* ~ ^jll'^ + JgsJlJl^^^J \\x,-x^\\-\ 

j=l ieej j = l i=l 

So in particular, for all (xi, . . . , Xk) € U we have that 




where / is given by eqn (2.1 1, but with a ~ (3 = 1. Therefore, we can pick a = /3 = 1 
without loss of generality; this just scales the minimum of / by an overall factor (a 
similar scaling property is derived in 1241 Theorem 1]). 

The function / is continuously differentiable for 7 > 2. Calculating the partial 
derivatives of /, we find that for l<m<d, l<n<k, 

df{xi,...,Xk) _ ai ^ ^ V llr - 7 ll''-2rT -r \ 

r, — „ |p I II * Jll ™« 

k 

aj „ .) . /3^V (x -X ) 

\ / ^ ^3 W'^'Ti '^3\\ y-^mn -^m j ) ~ ^ y ^ S-\-2 V ^rii ^mn ) • 

{3 n^^j} 



repelling charges 

(2.3) 



Note that eqn (2.3 ) simplifies considerably if we pick 7 = 2, as X^iee {xmi — 

^rnj) — {J2i£e ^mi) — l^j \ Zm j ~ 0, which makcs the first term disappear, and ensures 
that / G C°°{U, [0,00 [) is smooth. This motivate s us to pick 7 = 2. 

Calculating the repelling-charges part of eqn (2.3) for n = 1, . . . , A: requires 0{k'^) 



evaluations which is too expensive for the hypergraphs we envision, with a huge 
number of vertices k. Luckily, we can circumvent this problem using techniques from 
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[24] and [29j: we will build the d-dimensional equivalent of an octree to group the 
repelling charges into clusters and treat far-away clusters of charges as a single, but 
heavier charge. To ensure that this works properly the location of this larger charge 
will be the weighted average location of the cluster and the weight of the charge will 
be set equal to the sum of the charge weights in the cluster. Note that the repelling- 



charges part of / in eqn ( |2.1[ ) consists of applications of the map x i— >■ ||a;||~* for 
X € R"*, the Laplacian of which is given hy x ^ 5 {5 — [d — 2)) ||a;||^''^^. Hence, we 
can ensure that this part of / is harmonic (i.e. with vanishing Laplacian) by choosing 
5 = d ~ 2. This will in turn make our energy function behave well when treating 
clusters of far-away charges as a single, heavier charge, because of the mean-value 
property of harmonic functions (see [3j Theorem 1.6 and 1.24]). 
Recapitulating the above: 



1. we can pick a = /3 = 1 since this only scales a solution to eqn (2.2 1, 



2. we should pick 7 = 2 to be able to calculate the rubber band contribution to 



eqn (2.3) efficiently. 



we should pick 6 ~ d — 2 to be able to approximate the repelling charge 



contribution to eqn (2.3) by treating clusters of charges as a single, heavier 
charge. 

Inserting these constants we obtain the following expressions for / and its partial 
derivatives: 



; , fc fc 



— ■ - ■ =1 i=l II ' 1" 



j = l igej j= 



fc 



(2.5) 



2.2. Connectedness. Before we start describing an algorithm to solve eqn ( |2.1 
we should take care to ensure that such a solution actually exists. 



Theorem 2.2. Let G = iV,E), U, and f be given as in Definition 2.1 
suppose G is non-empty. 

Then precisely one of the following statements is true: 



1. there exists a solution [xi, . . . , Xk) to eqn (2.2), 

2. G is disconnected: we can write V — V1UV2 as a disjoint union with Vi,V2 ^ 
0, such that for all e ^ E we have e CVi or e CV2. 

Proof. Suppose G is disconnected. Without loss of generality, we can order the 
vertices such that Vi = {1, . . . , fc'} and V2 = {fc'-l-l, . . . , fc}. Suppose {xi, . . . ,Xk) € U 
has minimum energy. Then for all y £ IV^ \ {0} we have 



f{xi +y,...,xk' +y,xk' -y,...,xk-y) 

k' k 



fixu...,xk)+pJ2 E 



Wi Wj Wi Wj 



— ' ^-^ \\\2y + Xj~xA\^ ||x,- — a;,,|| 
i=i j=fc'+i Ml " * 1" II * Ji 



This equation holds because all hyperedges Cj e E are either completely contained 
in Vi or completely contained in V2, such that all Xi with i e Cj are either translated 
by +j/ or — y, respectively. This ensures that for all hyperedges Cj G E and vertices 
i € ej, the difference Xi — Zj remains the same, which in turn leaves the first term of 



eqn (2.1 1 unchanged. For the second term of eqn (2.1 1, we find that the difference 
Xi — Xj only changes if i G Vi and j (z V2, or vice versa. 
For all 1 < i < k' , k' < j < k, we have 

,. WiWj 1 WiWj 
hm T—— r — lim r-rr 7 — s , — 



) — ^00 



\2{ry) + Xi ~ x-iW^ r^oo jrl* \\2y + {xi - Xj)/r\\^ 



as 1/ 7^ and i5 > 0. In particular there exists an r > such that for all 1 < i < k' , 
k' < j < k we have 

Wi Wj Wi Wj 



So for this r we have 



< 0. 



f{xi +ry,...,Xk> +ry,Xk'+i - r y, . . . , Xk - r y) < f{xi,...,Xk), 

hence (xi, . . . ,Xk) does not have minimum energy; we have reached a contradiction. 
Therefore, no solution to eqn (2.2 1 exists. 

Suppose conversely that G has no disconnected components: for any disjoint union 
V = ViUV2 with Vi,V2 ^ there exists an e e E such that eH Vi 7^ and eH ^2 7^ 0- 
So in particular for every v,w G V there exists a path of hyperedges ei, . . . , e„ G E 
such that V G ei, w G Cn, and ej-i Ci Cj 7^ for all 1 < j < n. We can see this by 
writing V as the disjoint union {v} U {V \ {v}) which gives us ei and continuing by 
induction to obtain e^+i from the disjoint union V = (eiU. . .Uej)U(F\(eiU. . .Uej)) 
until we reach w (which will happen eventually as each new hyperedge adds at least 
one new vertex and our hypergraph is finite). 

Suppose that for a given R > 0, {xi, . . . , Xk) G U satisfies 

max ll^i — XjW > R. (2-6) 



We will now focus on two vertices for which the relative distance in eqn (2.6 1 is 
maximal. As described above there exists a path of hyperedges between these two 
vertices. So there exist vertices ii, . . . , i„+i G V and edges ei, . . . , e„ G E such that 
\\xi^ — 2;i„^J| is maximal, ii G ei, G e„, and i-j G ej-i D ej for 1 < j < n. Along 
this path, we have 



R<\\x^ 



< llx,, -Zi\\+ V(||Xj„ - Zj-l 



m=2 



Hence, one of these 2n terms must be at least R/{2n). So there exist a vertex 
p G V and a hyperedge Cq G E such that p G Cq and \\xp — Zq\\ > R/{2n) > R/{21). 
Therefore, / satisfies 

f{xi,...,Xk)>'^Cq\\Xp-ZqP>j^jyj ^mjn Cj^i?^. (2.7) 

Suppose that for a given r > 0, (xi, . . . , Xk) G U satisfies 

min 1 1 — Xj 1 1 < r. 

l<i<j<k 



Fix 1 < i < j < k such that || minimal, then 

Note that / is invariant under translations, so that without loss of generality we 
can restrict ourselves to solutions with Xi = 0. Let 

E /((0, 0, . . . , 0), (1, 0, . . . , 0), (2, 0, . . . , 0), . . . , (fc - 1, 0, . . . , 0)) 

be an upper bound for the minimum value of /. 

By eqn (2.71 we know that there exists an i? > such that f{0,X2, ■ ■ ■ ,Xk) > E 
whenever max^^j ||a;i — > R. On the other hand, by eqn (2.8| there exists an r > 
such that /(O, X2, ■ ■ ■ , Xk) > E whenever min^^j ||a;i— < r. So if (0, X2, • ■ • , x^) G U 
has minimal energy, then necessarily (0,X2, ■ ■ ■ ,Xk) € C where 

C:^ fl {{xi^0,X2,...,Xk)e-R'''"' \r<\\x,-x,\\<R}. 

l<i<j<k 

As C C R.'^x*: is both closed and bounded (because xi = 0), C is compact by Theorem 
1.8.17 in [TB]. We also have that C C U and / is continuous on U, so f is continuous 
on the compact set C and therefore f\c attains its minimum at a certain point in 
C by Theorem 1.8.8 of [16j . Since the minimum of / necessarily lays within C by 
construction, this is a solution to eqn (12. 21). □ 



Therefore, in solving eqn (2.2), we should restrict ourselves to connected hyper- 
graphs. If the provided hypergraph is not connected, we have to treat each connected 
component separately, where a solution is guaranteed to exist by Theorem |2.2[ 



Algorithm 1 Determines the connected components of a hypergraph G = {V,E), 
with V = {1, . . . , k}. Vertices v and w belong to the same connected component of G 
iS Pv = Pw This algorithm is adapted from [TOl Section 21.3]. 
1: for all V eV do 

2: py v; 

3: for all e e £^ do 

4: for all ti e e do 

5: while py 7^ pp^ do 

6: py ^ pp^^ ; 

7: p -S— min{p„ I w e e}; 

8: for all w G e do 

9: Pp„ ^ p; 

10: for all w e y do 
11: while py 7^ pp^ do 

12: py ^ pp^ ; 



2.3. Algorithm. We use the above observations to create a multilevel algorithm 
which approximates a minimum-energy solution. Firstly, we find the connected com- 
ponents via Algorithm [T] This algorithm works by creating a forest, i.e. a collection 
of rooted trees, where the parent of a vertex v € V is denoted by py e V, and each 
root satisfies py = v. The algorithm merges the trees of all vertices contained in a 
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Algorithm 2 Finds solutions to eqn (2.2) for a given connected hypergraph G 
{V,E), adopted from [24]. 

while G^ is too large do 

coarsen G^ to G^^^ with surjective tt^ : Vgj — > Vgj+i; 



let be a random visual representation for G^ 

(so pick x-'(w) e R*^ at random for all vertices v € Vq]); 

while J > do 

improve by Algorithm [sj 
if j > then 

for all vertices v e Vqj^i parallel do 

x'^^^{v) -s— scale x^{'k^{v)) + small random displacement; 
j ^ j - 1; 



hyperedge, for all hyperedges. After this is completed, each vertex is directly attached 
to its root, which represents its connected component. 

Secondly, we apply Algorithm [2] to each connected component to generate a visual 
representation. We create a hierarchy of coarsenings of the hypergraph and visual 
representations for these coarser hypergraphs, to avoid getting stuck in local energy 
minima. Here, we follow the graph visualization algorithm from 24 . We coarsen 
hypergraphs by creating a maximal matching that is constructed greedily from heavy 
(high cost) hyperedges, as heavy hyperedges have the tendency to pull the vertices 
contained in them closer together when we try to find the energy minimum, and 
then merging the matched vertices. To prevent star hypergraphs from forming (which 
disrupt this matching procedure [12] ) we also always match single-neighbor vertices to 
their neighbor. Coarsening a hypergraph G to a hypergraph H in this way, we obtain 
a surjective map tt : Vg — >■ Vh which maps each collection of matched vertices of G 
to a single vertex of H. We also merge hyperedges e,e' e Eq satisfying 7r(e) — 7r(e') 
(where 7r(e) = \ v S e}) to a single hyperedge and set the cost of this new 

hyperedge to the sum of the costs of e and e'. At line 10 of Algorithm [2] we introduce 
the parallel do construct. We use it to denote parallel for- loops that can directly be 
parallelized because their iterations are independent. 

For the coarsest version of the hypergraph, we start out with a random visual 
representation, which is then improved by Algorithm [3] using the steepest descent 
method. After the visual representation has been improved for the coarse hypergraph 
we scale it and add small random displacements, to ensure that different vertices in 
the fine hypergraph do not occupy the same position. Thus, we obtain a visual repre- 
sentation for the fine hypergraph. This layout is then again improved by Algorithm 
[3j We continue doing this until we obtain a visual representation for the original 
hypergraph. 

Algorithm [3] gives the details of the improvement procedure on line[8]of Algorithm 
[2] The part of the gradient of the energy function / belonging to Xn, is denoted by 
Un. We create a tree T which recursively groups the points belonging to the visual 
layout; this tree consists of nodes t G T which have position xt £ IV^ (the average 
position of all points contained in t) and weight Wt > (the sum of the weights 
of all vertices contained in t). The functionality of siblings of nodes in the tree is 
extended by letting the root have a dummy node as sibling (denoting the end of the 
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Algorithm 3 Improves a given approximate solution {xi, . . . ,Xk) G [/ to eqn (2.2) 
for a hypergraph G = ({1, . . . , fc}, {ei, . . . , e;}) using steepest descent. 



1 
2 
3 
4 
5 
6 
7: 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 



while we are not satisfied with the solution do 
build tree T recursively clustering xi, . . . ,Xk', 
for n = 1 to fc parallel do 

Vri ^ 0; 
for j — 1 to I parallel do 

for n = 1 to fc parallel do 

for all j such that n G ej do 



Vn ^ Vn + Cj {Xn - Zj)\ (cf. CqU ( [23| ) 

for n = 1 to fc parallel do 

t -s— root of T; 
while t 7^ dummy do 

if Xn is far away from t then 

yn^yn + {d- 2) WtWn \\xt - X„||~'' {xt - Xn)] (cf. CqU ( [Z^ ) 

t sibling of t] 
else if t has a child then 

t ^ child of i; 
else 

for all i e i, i 7^ n do 

Vn ^ Vn + {d-2) Wi Wn \\xi - a;„||"'^ {xi - Xn); (cf. eqn ( |2^ ) 
t ^ sibling of t; 
determine appropriate stepsize a > 0; 
for n = 1 to fc parallel do 

Xn Xn ^ Vn i 



tree traversal), and letting nodes without siblings have the sibling of their parent 
as sibling. This facilitates a fast, direct tree traversal without backtracking [55] ■ 
During steepest descent, we determine the step size by comparing the bounding box 
volume to the individual gradients of the vertices to prevent blowup for the first 
few steps and then decrease the found stepsize by multiplying it by 0.9 ^24^. Note 
that apart from the building of the tree grouping the hypergraph vertices, almost all 
parts of Algorithm [3] consist of d-dimensional floating point arithmetic that is directly 
parallellizable over all vertices and hyperedges. This makes Algorithm [3] suitable for 
many-core architectures such as GPUs. 

3. Partitioning. Suppose that xi, . . . ,Xk € R'' is a visual representation of our 
hypergraph G = ({1, . . . , fc}, {ei, . . . , e/}), obtained either directly or by using the 
methods from Section [2] Then we will use this spatial layout to create a partitioning 
of the hypergraph in a desired number of to e N parts. To do so, we employ the 
k-means++ method [2], given by Algorithm |4j This algorithm searches for to centers 
Zi, . . . , z„i S R*^ such that 

k 

V min \\x^-Zj\\^ (3.1) 

^ — ^ l<j<m 
i—1 

is minimal, which is NP-hard for all to > 2 [T]. The advantage of k-mecLns++ is that 



the algorithm finds zi, . . . ,Zm in 0{km logm) time, while the value of eqn (3.11 is 
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expected to be within a factor of 8 (log m + 2) from its minimum value already at the 
start (line [6]) of the first iteration [5]. The k-meELns++ algorithm therefore permits 
us to isolate clusters quickly in the visual representation of our hypergraph, which 
correspond to highly interconnected patches of vertices in the hypergraph. Algorithm 
|4] is furthermore easily parallelized in shared memory (lines |3] and [7|. The sum at 
line [TT] can also be performed in parallel by computing partial sums for the Zj and 
summing these afterwards. 



Algorithm 4 The k-means++ algorithm finds cent ers z i, . . . , Zm G R-'^ for a given 
set of points xi, . . . ,Xk € R'', trying to minimize eqn (3.1 ). 



set zi to a randomly chosen point from {xi, . . . , ccfc}; 
for n — 2 to m do 

for i = 1 to A; parallel do 

^ mini<j<„ \\xi - Zjjp; 
choose Zn to be equal to Xi with probability di / (c?i 
for a fixed number of iterations do 
for i = 1 to fc parallel do 

argmini<j<„||a;j - Zj 
: 1 to m do 
= {l<t<k\ 



Ji ^ 
for j 



2. 



I C.I 



dk); 



ieCj 



The m disjoint subsets Ci, . . . , Cm ^ V produced by Algorithm [4] form an m-way 
partitioning of V. It should be remarked that this way of generating a partitioning 
does not enforce balancing of the partitioning, but in general k-meELns++ does a good 
job of dividing the point set into parts of approximately equal size: large groups 
of points pull centers harder towards themselves, which enlarges the other, smaller, 
groups. Such balancing can be observed in Fig. |l.l| and Fig. |5.1| Partitionings gener- 
ated by Algorithm |4] can further be improved by subjecting them to a few iterations 
of the Kernighan-Lin algorithm [27 . 

4. LU decomposition. We will now apply the ideas discussed in the previous 
sections to performing an LU decomposition of a given matrix in parallel using nested 
dissection [H [H] , see also [1 [HI UHl 123 121] ■ 

Definition 4.1. Let A G C™^™ be a given mxm matrix. Then a (permuted) 
LU decomposition of A is a decomposition of the form 

PAQ = LU, (4.1) 

where P,Q E {0, l}"'^*'" are permutation matrices and L,U E c™^™ with entries Uj, 
Uij respectively, such that Uj =0 for i < j , Ui j — for i > j , and la — 1 for all i. 

We include permutations in Definition |4.1| to ensure that well-behaved matrices 
like ( 5 ) also have an LU decomposition. Calculating an LU decomposition for a given 
matrix with complete pivoting is described by Algorithm |5] While this algorithm is 
fine for small dense matrices, we get into trouble with large sparse matrices for two 
reasons: Algorithm [5] takes 0{m^) iterations (which scales rather badly), regardless 
of matrix sparsity, and as illustrated by Example |4.2[ the sparsity of the original 
matrix may be lost during the decomposition, requiring up to 0{m'^) memory. We 
will remedy these problems by calculating appropriate permutation matrices P and 
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Algorithm 5 LU decomposition (algoritiim 3.4.2 from [19]). Determines for a matrix 
A e C"'^™ with entries Uij factors P, Q, L, and U such that P AQ ^ LU. 

1: for fc = 1 to m do 

2: TTj. k, ak k; 

3: for fc = 1 to TO do 

4: find k < i, j < m such that |aij | is maximal; 
5: swap rows k and i and swap tt^ and tt^; 
6: swap columns fc and j and swap and aj ; 
7: if Ofe fe 7^ then 
8: for i = fc + 1 to TO do 

9: dik ^ aik/ukk] 

10: for i = fc + 1 to TO do 

11: for = fc + 1 to TO do 

12: j ^ J k j : 

13: set Pij to 1 if TTi = J and otherwise, to form P; 

14: set Qij to 1 if tTj = 2 and otherwise, to form Q; 

15: set lij to aij if i > j, 1 if z = j, and otherwise, to form L; 

16: set Uij to aij a i < j and otherwise, to form U ; 



Q, using the techniques from the previous sections. 

Example 4.2. Consider the following decomposition (using Algorithm^ without 
pivoting): 
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Here, the sparsity pattern of the original matrix is lost completely in the L and U 
factors, and a fill-in of six new nonzeros is created. Suppose we permute the ma- 
trix by swapping the first and last rows and columns, then we obtain the following 
decomposition: 
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Now, L and U have the same sparsity pattern as the original matrix. Furthermore, 
performing Algorithm on the permuted matrix also required less work because of 
zeros that were preserved during decomposition: the two loops around line \lS\ can skip 
most rows and columns because either a, j. or a^j is equal to 0. 

4.1. Recursive BBD form. Following Example |4.2| we will try to bring the 
sparse matrix into Bordered Block Diagonal (BBD) form (25^ as illustrated in Fig. 
4.1 (a). The block in the lower-right corner is commonly called the Schur complement. 
A row that contains a nonzero in a column that intersects the first diagonal block and 
also in a column that intersects the second diagonal block is said to be cut or split 
with respect to the subdivision of the matrix. Cut columns are defined similarly. 
Performing Algorithm [5] on a matrix in such a form will only generate fill-in in the 
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(a) 



(b) 



(d) 



Al 
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A2 


As2 


Ac2 


^Sl 


aR 


As 
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^c& 


AR 

^Cl 


Ac2 


AR 


Ac 



Figure 4.1. Nested dissection for LU decomposition, (a) Bordered Block Diagonal (BBD) 
matrix form; (b) recursive BBD matrix form; (c) LU decomposition contributions; (d) matrix form 
for Algorithm^ 



shaded blocks and the costly inner loop can skip the empty blocks. By doing this 



recursively, see Fig. 4.1 (b), the amount of fill-in will be further reduced. This 
principle is called nested dissection |18j . 

Such a recursive matrix layout furthermore permits us to create a parallel LU 



decomposition algorithm (similar to [28 ). We illustrate this in Fig. 4.1 (c) where 
we are busy performing Algorithm [5] along the diagonal of diagonal block 1 (for the 
sake of simplicity we do not perform any pivoting) . For this nonzero on the diagonal, 
performing LU decomposition will only modify the darkly shaded parts of the matrix 
and therefore leave the diagonal blocks 2, 4, and 5 untouched. Furthermore, the LU 
decomposition contributions of all diagonal blocks to the Schur complements 3, 6, and 
7 do not depend on the actual values in the Schur complements, so we can perform 
LU decomposition on blocks 1, 2, 4, and 5 in parallel and add the contributions to the 
Schur complements afterwards. To process the nonzero on the diagonal of block 1, we 
need nonzero values not only from the cut rows and columns of Schur complement 3 
(solid arrows), but also from the rows and columns of Schur complement 7 (dashed 
arrows). So we need to keep track of all previous Schur complements during parallel 
LU decomposition. 

For the parallel LU algorithm outlined in Algorithm |6j we therefore work recur- 
sively on a matrix of the form illustrated in Fig. |4.1| (d). Here, we added contribution 
blocks Ac, ^ci' ^C2^ ^CS' ^ci' ^C2' ^^"^ ^cs ^^'^ matrix, which are indicated 
by a darker shade. At the start of Algorithm |6] (so for the original matrix A), these 
are empty, but as the algorithm further recurses, the contribution blocks will contain 
all the contributions of the LU decomposition to Schur complements of the previous 
recursion levels (i.e. the data required for the dashed arrows in Fig. |4.1| (c)). In an 
implementation of Algorithm |6] it would be efficient to store the nonzeros Oi j of the 
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matrix by increasing niin{z,j}. Thus, we keep all data necessary to perform the LU 
decomposition at line [6] together, regardless of the level of recursion. This is better 
than storing the nonzeros by increasing i (Compressed Row Storage, CRS) or j. 



Algorithm 6 Parallel recursive LU decomposition of the matrix from Fig. 4.1 (d) 



1: if we wish to continue subdividing then 

2: apply Algorithm [6] recursively and in parallel to 
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A'^ 


^Sl 








aR 
^Cl 









Ai 


^S2 




A^ 








A%-2 









3: add the contributions from A\ and A2 to Ag-, Ac, A^g, and A^g; 
4: perform LU decomposition (e.g. Algorithm Isl) on 



As 


Acs 


Acs 


Ac 



only permuting within Ag and stopping after factorizing Ag; 
5: else 

6: perform LU decomposition on A, only permuting within the lightly shaded part 
of the matrix and stopping after factorizing Ai, A2, and Ag; 



It is important to note that the LU decompositions performed in Algorithm [6] on 
line [4] and line [6] arc incomplete in the sense that they only treat part of the given 
matrix. In Algorithm [s] this would amount to specifying a number 1 < m' < m and 
letting k run from 1 to m' at line [s] and choosing i and j such that k < i, j < m' at 
line|4] To improve performance, multifrontal [TT] or supernodal [13] methods could 
be used to perform these LU decompositions. The condition at line [l] in Algorithm 
[6] is used to stop the algorithm whenever the resulting diagonal block becomes so 
small that a direct LU decomposition would outperform further recursion, or when 
there is a risk of the diagonal matrices becoming singular. Note that Algorithm |6] is 
processor-oblivious in the sense that we can continue recursing on the diagonal blocks 
while there are still more processors available, up to the recursion depth where the 
diagonal blocks are still sufficiently large. 

Since with this parallel method we can only pivot within each block (not doing 
so would destroy the recursive BED form), we could encounter a singular submatrix, 
as illustrated by Theorem |4.3| and Example |4.4[ 

Theorem 4.3. Let A € C°^°, B e C^^VC" e C=^= such that d := a- (b+c) > 
and A is of the form 
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B 







A^ 







c 
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D 
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//det(A) ^ 0, then 

b + c-d< rank (B) + rank (C) <b + c. 



(4.2) 



Proof. First of all, note that if the matrix A' G (jaxa obtained from A us- 
ing Gauss- Jordan elimination with column pivoting, then det(v4) = if and only if 
det{A') = 0. Suppose that det(A) ^ 0, then by performing these operations on B 
and C separately, we find the nonzero value 



det 



^rank(B) 



















-^rank (C) 
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^■ank(_B) 




























±det ^ 
















D 



The resulting smaller matrix has size a — rank (_B) — rank (C) and must be of maximum 
rank because its determinant is nonzero. Let e := a — rank (B) — rank (C) — d, then 
e>a — b— c— d = 0. Ife<d, a matrix with the above nonzero pattern can have 
maximum rank e + d. If e > d, the rank of such a matrix can be at most 2d < e + d. 
Therefore, it is necessary that e < d a — rank (B) — rank (C) — d < d 

a-2d < rank (B) + rank (C) <^=^ {b + c + d)-2d< rank (B) + rank (C), from 
which eqn (4.2) follows. □ 



Theorem |4 . 3 1 shows us that we cannot assume our diagonal blocks to be invertible 
whenever the Schur complement is nonempty. Furthermore, it motivates us to reduce 
the size of the Schur complement: this will increase the minimum rank that the 
diagonal blocks are required to have and thereby increases stability. In terms of 
hypergraph partitioning we therefore see that we should at all times try to make the 
Schur complement as small as possible: this will increase parallelism in the sense that 
more rows/columns can be treated in parallel by Algorithm [6j it will reduce fill-in, 
and it will improve stability^ 

To prevent the diagonal blocks from becoming singular we allow for an optional 
specification of a desired (strengthened) matrix diagonal beforehand [TS] , which will 
be preserved by the generated permutations as described in Section [4.2[ As Example 
4.4 shows however, this is not guaranteed to solve the problem. 

Example 4.4. Let 



A = 
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2/ 



B = 



C = 



c = 2, d = 1, 



Then a = 5, b = 2 

Therefore, the bound in eqn (4--2) is tight, 
that det{B) 



rank(B) = 1, rank(C) = 2, and dct(A) = 3 7^ 0. 
6 + c- d = 3 = rank (B) + rank (C) . Note 



even though the product of the diagonal elements of A is maximal. 



■^So hypergraph partitioners used for the purpose of bringing the matrix into recursive BBD form 
should use the cut-net metric instead of the (A — l)-metric, reducing the number of cut hyperedges, 
and not the associated communication volume. 
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During the performed benchmark with SuperLU (Table 5.1 ) we found that for 8 of 
the 28 matrices no pivoting was required at all (not even in the Schur complements), 
and in all other cases pivoting with a threshold of 10~^ was sufficient. Therefore, if we 
keep the Schur complements small, strengthen the matrix diagonal, and use threshold 
pivoting within diagonal blocks and Schur complements, we anticipate that submatrix 
singularity will not pose any significant problems in practice. 

4.2. Permutations. We will now apply the ideas from sections |l.l[ [2} and [3] to 

obtain the desired permutations to bring a given sparse matrix A e (jmxrn ^[^i^ 
nonzeros into recursive BBD form. This method will be referred to as visual matrix 
ordering (VMO). We assume the matrix to be square, because we want to use VMO 
for LU decomposition. 

Firstly, we need to determine what kind of hypergraph we will use to represent 



A (from Table 1.1). Using only the symmetric representation is not appropriate, 
because LU factorization is usually applied to unsymmetric matrices. The column- 
net and row-net approach often do not yield optimal partitionings when compared 
to the finegrain representation 8J. However, the finegrain representation results in 
nz vertices, thus degrading the performance of Algorithm [Sj which would scale as 
0(712 log(n2) -t- to). Inspection of the visual layouts revealed that a good layout for 
the finegrain representation could be obtained by laying out its dual (the bipartite 
representation, which is a graph) and then mapping each nonzero to the average of the 
points of the row and column belonging to that particular nonzero. As the bipartite 
representation has only 2 to vertices instead of nz, the layout can be generated much 
faster, scaling as 0{m log(TO) -I- nz). It also permits us easily to maintain a previously 
selected strengthened diagonal in the generated permutations 

Therefore, let G = (V, E) be the bipartite representation of our sparse matrix 
A. To avoid the problem of ending up with a singular matrix during recursion, we 
permit a desired strengthened diagonal to be specified with the matrix, in the form 
of a perfect bipartite graph matching MCE, [15]. We will view this matching as 
a map fj, : V V which maps each vertex v G V to fi{v) G V such that the edge 
{v, /i(w)} G M (as M is a perfect matching, exactly one vertex fJ.{v) has this property). 

We can also incorporate the values of the nonzeros of the matrix in the partitioning 
by setting the edge costs of G to \aij\ for each edge € E (optionally rescaling 

these values to a fixed interval to avoid convergence issues in Algorith m [3 ) . This is 



natural, because zeros of the matrix are not incorporated at all in eqn (2.1 ) (as they 
are not included in E), so letting the edge cost of {i, j} go to as \aij \ — > gradually 
decreases the influence of {i,j} on the energy function of G to zero. However, as this 
reduced the quality of the partitionings in terms of fill-in, we did not use this option 
for the performed experiments. 

Using Algorithm [2] we generate a visual representation of G. Then we apply 
Algorithm [t] to obtain V = Vi U V2 U V3 and E = Ei U E2 (J E3, where p{v) and 
q{e) denote the part indices for vertices v and edges e. Algorithm n\ turns the edge 
separator, obtained by partitioning the vertices using Algorithm Hi into a vertex 
separator. To do so we exploit the geometry of the partitioning by letting the vertices 
closest to the plane (described in Algorithm [t] by its normal r and distance S on 
line [7]) separating the two groups of vertices be chosen to be added to the vertex 
separator. We ensure that we preserve the strengthened diagonal in lines [Tl] and [13] 
this ensures that edges from the matching M are contained completely in either Vi, 
V2, or V3, which prevents them from entering the darker off-diagonal blocks in Fig. 



4.2 This can be skipped if no matching is available or desired (e.g. in the context of 
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Algorithm 7 Given a graph G = (V, E) with visual representation x : V —?' IV^ 
and a map fi : V ^ V derived from a perfect bipartite graph matching M C iJ, 
this algorithm partitions V and -E into Vi , V2 , V3 and Ei,E2, -E3 respectively, where 
no {iijw} G E exists with w G Fi and w & V2, and such that e € £^1 — >■ e C Vi, 



e e £^2 ^ e C ^2, and e e £^3 ^ e n 7^ (Fig. 4.2 (left)) 



1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 



determine two centers zi, 22 £ R*^ in a;(y) C R"* by Algorithm [ij 
for all w € y in parallel do 

if \\x{v) — zi\\ < \\x{v) — Z2\\ then 

p{v) ^ 1; 
else 

p{v) ^ 2; 
r ^ Z2 - zi; S ^ ^{z2 + zi) ■ r; 
for all e = {u, £ E do 
if {p{v),p{w)} = {1,2} then 

if \x{v) ■ r — 5\ < \x{w) ■ r — S\ then 

p{v) i~ 3; p{fi{v)) 4- 3; 
else 

p(w) ^ 3; p{fi{w)) ^ 3; 
for all e — {v,w} G E m parallel do 

g(e) ^ m.ax.{p(v),p{w)}; 
sort the vertex pairs {v, niv)} by their p-values to obtain Vi, V2, T^3; 
sort the edges by their values to obtain Ei, E2, E^; 
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E3 


E3 
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Figure 4.2. Matrix partitioning for Algorithm^ (Isft) md improved partitionings obtained by 
bringing either the Schur complement (middle) or the cut rows and columns (right) into BBD form. 



sparse matrix- vector multiplication instead of LU decomposition), which will result 
in smaller V3 and E3. However, for LU decomposition it is necessary, in particular to 
maintain square blocks on the diagonal. After the first iteration of Algorithm [7] we 
again apply it to Gi = (Vi, i?i) and G2 — {V2, E2) and continue doing this recursively 
to obtain recursive BBD permutations for our matrix A as shown in the rightmost 
column of Fig. 

The permutations themselves can directly be obtained from the recursive parti- 



tioning of V: the rows and columns of the block Ei (see Fig. 4.2) are exactly the 
vertices in Vi, and similarly for the rows and columns of the blocks E2 and £3. There- 
fore, a simple linear walk through the reordered vertices (line 16 of Algorithm [?]) will 
provide the proper permutations of the rows and columns of our matrix A. 

When permuting the matrix to recursive BBD form, we have additional freedom 
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in permuting the rows and columns of V3 in Fig. 4.2 (left) (also see Fig. |5.1| ). A direct 
way to do this is also to apply Algorithm [t] recursively to G3 = (V3, E'^), just like we 
do for Gi and G2. Here E'^ consists of all edges e G E3 satisfying e C V3 (so E'^ is 



the lightly shaded E3 part of Fig. 4.2 (left)). This gives permutations as illustrated 
in Fig. 4.2 (middle). An advantage of this method is that the strengthened diagonal 
is also maintained within the Schur complement. 

Another way in which the additional freedom can be used, is to bring the cut 



rows and columns into recursive BBD form as illustrated in Fig. 4.2 (right). Doing 
this is a little more tricky: first of all, we assign a two-bit number to each vertex in 
V3, initially set to 00. We also keep track of the edges E13 between Vi and V3, and 
edges E23 between V2 and V3. Now, if Vi (i — 1,2) is split with Algorithm [t] into 
Vii, Vi2, and Vi^ we can loop through all edges {v,w} in Ei^ with w £ V3. Then if 
u G 14 1 we set the first bit of the number associated with w and ii v £Vi2 we set the 
second bit. If we do this for both splits of Vi and V2, and then sort the vertices in V3 



by their two-bit numbers we obtain a permutation as shown in Fig. 4.2 (right). By 
expanding these numbers to b two-bit pairs and keeping track of the edges extending 
to the Schur complements for up to b splits, we can bring the cut rows and columns 
into recursive BBD form up to the 6th level. 

5. Experiments. We implemented the VMO algorithm in C-|— I- using the In- 
tel Threading Building Blocks library for many-core parallelism where we chose to 
generate visual representations in d = 4 dimensions to be able to perform all parallel 
vector calculations in Algorithm [s] and Algorithm |4] efficiently on either the CPU (one 
Streaming SIMD xmm* register for a point in R^) or the GPU (a f loat4 register for 
a point in R^). This furthermore ensures that we do not need to take a square root 



in eqn (2.5 1 



To measure the quality of the generated permutations we compared VMO to the 



Super LU [13] LU decomposition package by measuring fill-in, see Table 5.1 In this 
case we made use of the additional freedom in the cut rows and columns by also 
recursively subdividing the cut parts of the graph while retaining the strengthened 



diagonal (Fig. 4.2 (middle) and Fig. 5.1 (left)) to ensure that few small pivots are 



encountered along the diagonal. We performed four decompositions for each matrix 
where we used permutations generated by VMO, as well as the built-in COLAMD(A), 
MMD(A^ + A), and MMD(A^v4) column permutations generated by SuperLU. For 
the permutations generated by SuperLU we retained the default SuperLU 4.1 options, 
while for the VMO permutations we first performed a run without any pivoting and 
then a run with threshold pivoting using a value oi u — 10^^. To ensure we would 
not run into numerical problems we used a strengthened diagonal obtained via a heavy 
edge matching in the bipartite representation of A, augmented to a perfect matching 
via the Hopcroft-Karp algorithm [23]. We furthermore validated the decomposition 
by comparing calculated condition numbers for all permutation methods and letting 
SuperLU calculate the backward error of the solution to Ax = h obtained by solving 
the system using the decomposition A — LU (section 3.1 of [19 ) for b = A{1, . . . , 1)-^. 



Table 5.1 shows that in 8 of the 28 cases, decomposition of the matrices permuted by 
VMO did not require any pivoting at all, not even in the Schur complements. From 
the table we see that VMO compares favorably with SuperLU: looking at the lowest 
fill-in of COLAMD(A), MMD(A^ -t- A), and MMD(A^ A) and the fill-in of VMO for 



^SuperLU performs row pivoting by generating a row permutation tt such that for all 1 < i < m, 
If^TrO) jl ^ ^ ™axi<i<„ \ciij\ where u £ [0, 1] is the desired threshold, see 1141 eqn (4.4.7)]. 
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Matrix 


Size 


Nonzeros 


VMO 


CMD 


MMD+ 


MMDx 


swangl 


3169 


20841 


6.2 


7.7 


6.7 


8.2 


lns_3937 


3937 


25407 


15.0 


17.5 


132.2 


17.5 


poli_large 


15575 


33074 


1.6 


1.6 


1.6 


1.6 


marks jac020* 


9129 


56175 


68.1 


45.6 


121.3 


43.9 


fdl8* 


16428 


63406 


21.9 


24.1 


302.0 


25.5 


lhr04* 


4101 


82682 


6.0 


4.1 


20.6 


4.3 


raef sky6 


3402 


137845 


2.7 


3.4 


4.5 


3.1 


shermanACb* 


18510 


145149 


19.0 


45.3 


14.3 


57.2 


bayer04* 


20545 


159082 


10.2 


4.2 


41.8 


4.2 


Zliao2* 


33861 


166453 


158.1 


115.1 


1280.1 


107.0 


mult_dcop_03 


25187 


193216 


3.1 


2.0 


3.4 


5.9 


jan99jacl20sc* 


41374 


260202 


71.8 


15.9 


52.4 


19.7 


bayerOl* 


57735 


277774 


7.5 


5.4 


47.6 


5.6 


sincl2* 


7500 


294986 


37.8 


44.7 


36.3 


45.3 


onetonel* 


36057 


341088 


32.1 


14.4 


149.0 


14.2 


mark3jacl40sc* 


64089 


399735 


111.0 


125.7 


4435.0 


152.0 


af23560 


23560 


484256 


24.8 


25.0 


82.7 


26.9 


e40r0100* 


17281 


553562 


9.2 


9.2 


137.5 


8.4 


sincl5* 


11532 


568526 


56.3 


58.0 


48.7 


57.2 


Zd_Jac2_db* 


22835 


676439 


9.6 


5.1 


32.1 


5.7 


llir34c* 


35152 


764014 


7.0 


4.7 


50.5 


4.7 


sinclS* 


16428 


973826 


65.7 


67.8 


68.2 


72.3 


torso2 


115967 


1033473 


10.2 


16.8 


8.2 


14.5 


twotone 


120750 


1224224 


35.8 


15.2 


1448.1 


17.0 


lhr71c* 


70304 


1528092 


6.7 


4.8 


66.4 


4.7 


av41092* 


41092 


1683902 


64.6 


26.0 


177.6 


23.8 


bbmat* 


38744 


1771722 


32.0 


26.7 


1000.6 


26.8 



Table 5.1 



Comparison between VMO and SuperLU 4.1 in terms of fill-in, defined as {nz{L) + nz(U) — 
nz{I))/nz{A) for A = L U. The best result for each matrix is bold, CMD = COLAMD, MMD+ = 
MMD(A'^ +A), and MMDx = MMD(A'^ A) are the column pre-orderings determined by SuperLU. 
Matrices marked with * required threshold 10~* pivoting for VMO. 



each of the 28 test matrices, we find that on average the fill-in of VMO equals 1.52 
times the lowest fill-in of the other methods, and that VMO outperforms all other 
methods in 8 cases. This indicates that the permutations generated by VMO are 
useful in the context of sparse LU decomposition. 

We also compared VMO with Mondriaan |31j in terms of matrix partitioning. 
Firstly, we did this in the context of cache-oblivious sparse matrix-vector multiplica- 
tion where the matrices are permuted into recursive Separated Block Diagonal (SBD) 
form [33] (with the cut rows and columns in the middle instead of at the end) to 
decrease the number of cache-misses, independent of the particular cache hierarchy 
of the processor performing the multiplication. Results are further improved by also 
using the additional freedom in the cut rows and columns to bring these into recursive 



SBD form (Fig. 5.1 (right)). We measure the matrix multiplication time with the 
same program and on the same processor as [34| : a single node of the Huygens su- 
percomputer equipped with a dual-core 4.7GHz IBM Power6-t- processor with 64kB 
LI cache per core, a semi-shared L2 cache of 4MB, and an L3 cache of 32MB on 
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which the matrix-vector muhipUcation program has been compiled with the IBM XL 
compiler. In Table |5.2[ we compare the matrix-vector multiplication time for the 
original matrix with the best result from [34' (where the matrix has been permuted 
by Mondriaan), and with the result obtained by using VMO. 



Matrix 


Rows 


Columns 


Nonzeros 


Orig. 


m 


VMO 


ex37 




3565 


3565 


67591 


0.116 


0.113 


0.113 


memplus 




17758 


17758 


126150 


0.308 


0.300 


0.280 


rhpentium 




25187 


25187 


258265 


0.645 


0.627 


0.646 


lhr34 




35152 


35152 


764014 


1.37 


1.34 


1.34 


lp_nug30 




52260 


379350 


1567800 


5.35 


4.85 


9.15 


s3dkt3m2 




90449 


90449 


1921955 


7.81 


7.27 


7.80 


tbdlinux 




112757 


21067 


2157675 


6.43 


5.03 


5.66 


Stanford 




281903 


281903 


2312497 


19.0 


9.35 


5.88 


stanf ord_berkeley 


683446 


683446 


7583376 


20.9 


19.2 


22.5 


wikipedia- 


20051105 


1634989 


1634989 


19753078 


249 


116 


128 


cagel4 




1505785 


1505785 


27130349 


69.4 


74.4 


99.0 


wikipedia- 


20060925 


2983494 


2983494 


37269096 


688 


256 


264 



Table 5.2 

Comparison with Mondriaan in the context of cache-oblivious sparse matrix-vector multiplica- 
tion \34^ . We compare the original matrix-vector multiplication time with the best time from Jg^F 
(which used Mondriaan 3.01 for reordering) and the best time with VMO. 



VMO performs poorly for lp_nug30 and cagel4. For lpjnug30, this can be ex- 
plained by a lack of underlying geometrical structure: the visual representation of this 
matrix is a featureless blob from which little extra information can be obtained, result- 
ing in quite bad permutations. The matrix cage 14 already possesses a nonzero layout 
that is well suited for matrix-vector multiplication: both Mondriaan and VMO fail to 
improve the matrix-vector multiplication time. For tbdlinux, wikipedia-20051105, 
and wikipedia-20060925 VMO shows improvements comparable to those of Mon- 
driaan, while for memplus and Stanford the results are even better. As generating 



the permutations with VMO is much faster (see Table 5.3 1, this makes VMO a viable 
alternative to Mondriaan in this context. 

Another comparison with Mondriaan was made in terms of the cut-net metric, 
which is the appropriate metric in the context of LU decomposition because of The- 
orem |4.3[ Hence, we look at the maximum number of cut rows and columns in all 
matrix (sub)divisions. While Mondriaan divides the matrix among a given number 
of processors, VMO continues subdividing the matrix until it can no longer continue. 
Therefore, we ran Mondriaan with a hybrid splitting strategy for the cut-net met- 
ric to divide the matrix into 256 parts with a permitted imbalance of 0.1 to obtain 
permutations comparable to those of VMO. 

In Table |5.3[ we measure the time it takes Mondriaan to perform the matrix par- 
titioning and divide this by the time it takes VMO to generate a visual representation 
of the matrix and to generate a partitioning from this visual representation. All tim- 
ings except for those marked with * were measured on a system with a quad-core 
2.8GHz Intel Core 17 860 processor and 8GB RAM, in particular to illustrate the 
gains of using VMO on a many-core system. The entries marked with * needed to be 
benchmarked on a different system, because of Mondriaan's memory requirements: 
these were performed on a dual quad-core 2.4GIIz AMD Opteron 2378 system with 



32GB RAM. From Table 5.3 we find that VMO is on average 21.6 times faster than 
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Figure 5.1. The matrices rhpentium (left) and ujikipedia-20070206 (right), permuted by VMO 
to recursive BBD an d rec ursive SBD form, respectively. Additional permut ation freedom is used for 
rhpentium as in Fig. \4.S\ (middle), and for wikipedia-20070206 as in Fig. \4.S\ (right). 



Matrix 


Speedup 


Speedup 


Relative 




V + 





cut 


ex37 


13.4 


53.2 


0.39 


memplus 


2.1 


11.0 


2.86 


rhpentium 


14.5 


57.0 


1.08 


lhr34 


6.5 


39.8 


2.47 


lp_nug30 


9.1 


144.7 


2.20 


s3dkt3m2 


4.7 


26.7 


1.02 


tbdlinux 


25.0 


228.3 


0.75 


Stanford 


7.9 


78.6 


2.37 


stanf ord_berkeley 


17.0 


118.0 


3.26 


wikipedia-20051105 


36.0 


290.7 


0.61 


cagel4* 


4.3 


29.1 


0.62 


wikipedia-20060925* 


119.4 


1104.3 


1.02 



Table 5.3 



Comparison with Mondriaan in terms of the calculation time and the largest number of cut 
rows/columns in a split. Speedup is defined as the time required by Mondriaan 3.01 to perform the 
matrix partitioning, divided by the time required by VMO to generate both the visual representation 
and permutations (V + O) or just the permutations (O). The last column gives the maximum of 
the number of cut rows and columns in all splits of VMO, divided by the maximum obtained by 
Mondriaan. Entries marked with a * were benchmarked on a different system because of a lack of 
memory. The test matrices are the same as those from Table [571| 



Mondriaan, and if for all matrices a visual representation would already have been 
given the average speedup would even be 181.8. We also measure the maximum of the 
number of cut rows and columns in all subdivisions of the matrix for VMO and divide 
this by the maximum for Mondriaan. This gives a measure for the relative maximum 
cut size when comparing the two methods: the maximum cut size obtained by VMO 
is on average 1.55 times that of Mondriaan and in four cases it is less. To make the 
comparison as fair as possible we used Mondriaan with the cut-net metric for par- 
titioning, but it should still be remarked that minimizing the maximum number of 
cut rows and columns is not the primary objective of Mondriaan and the balancing 
restrictions placed on Mondriaan are absent for VMO. 
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6. Conclusion. We have shown that it is possible to create and use the visual 
representations of hypergraphs to generate partitionings and orderings which are of 



sufficient quality for sparse LU decomposition (Table 5.1) and sparse matrix-vector 
multiplication (Table [5^ . Our method generates orderings on average 21.6 times 
faster than Mondriaan (Table [sTs] ). We generalized the 2D/3D graph visualization 
method from [241 to generate hypergraph geometries in higher dimensions. Further- 
more, the algorithms to generate visual representations (Algorithm [3]) and matrix 
orderings (Algorithm [7| are well suited to shared-memory many-core parallel archi- 
tectures such as current many-core CPUs and CPUs. We have implemented these 
algorithms in the software package VMO. 

This also opens up opportunities for further research, such as moving from a 
shared-memory parallel architecture to distributed-memory, which would require sig- 
nificant modifications of Algorithm |3] Algorithm [7] and the data structures involved. 
Since VMO is fast and parallel, it also has potential to remove computational par- 
titioning bottlenecks in large applications such as the human bone simulations in 
0. 
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